{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2019 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Purchase Prediction with AutoML Tables\n",
    "\n",
    "<table align=\"left\">\n",
    "  <td>\n",
    "    <a href=\"https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-keras\">\n",
    "      <img src=\"https://cloud.google.com/_static/images/cloud/icons/favicons/onecloud/super_cloud.png\"\n",
    "           alt=\"Google Cloud logo\" width=\"32px\"> Read on cloud.google.com\n",
    "    </a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
    "    </a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
    "      View on GitHub\n",
    "    </a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OFJAWue1ss3C"
   },
   "source": [
    "## Overview\n",
    "\n",
    "One of the most common use cases in Marketing is to predict the likelihood of conversion. Conversion could be defined by the marketer as taking a certain action like making a purchase, signing up for a free trial, subscribing to a newsletter, etc. Knowing the likelihood that a marketing lead or prospect will ‘convert’ can enable the marketer to target the lead with the right marketing campaign. This could take the form of remarketing, targeted email campaigns, online offers or other treatments.\n",
    "\n",
    "Here we demonstrate how you can use Bigquery and AutoML Tables to build a supervised binary classification model for purchase prediction.\n",
    "\n",
    "### Dataset\n",
    "\n",
    "The model uses a real dataset from the [Google Merchandise store](https://www.googlemerchandisestore.com) consisting of Google Analytics web sessions.\n",
    "\n",
    "The goal here is to predict the likelihood of a web visitor visiting the online Google Merchandise Store making a purchase on the website during that Google Analytics session. Past web interactions of the user on the store website in addition to information like browser details and geography are used to make this prediction.\n",
    "\n",
    "This is framed as a binary classification model, to label a user during a session as either true (makes a purchase) or false (does not make a purchase). Dataset Details The dataset consists of a set of tables corresponding to Google Analytics sessions being tracked on the Google Merchandise Store. Each table is a single day of GA sessions. More details around the schema can be seen here.\n",
    "\n",
    "You can access the data on BigQuery [here](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089).\n",
    "\n",
    "## Instructions\n",
    "\n",
    "Here is a list of things to do with AutoML Tables:\n",
    "\n",
    "\n",
    " * Set up your local development environment (optional)\n",
    " * Set Project ID and Compute Region\n",
    " * Authenticate your GCP account\n",
    " * Import Python API SDK and create a Client instance,\n",
    " * Create a dataset instance and import the data.\n",
    " * Create a model instance and train the model.\n",
    " * Evaluating the trained model.\n",
    " * Deploy the model on the cloud for online predictions.\n",
    " * Make online predictions.\n",
    " * Undeploy the model\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dMoTkf3BVD39"
   },
   "source": [
    "# 1. Before you begin\n",
    "\n",
    "## Project setup\n",
    "\n",
    "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n",
    "* Create a Google Cloud Platform (GCP) project, replace \"your-project\" with your GCP project ID and set local development environment.\n",
    "* Enable billing.\n",
    "* Enable AutoML API.\n",
    "* Enter your project ID in the cell below. Then run the  cell to make sure the\n",
    "\n",
    "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n",
    "all the requirements to run this notebook. You can skip this step from the AutoML Tables documentation\n",
    "\n",
    "Cloud SDK uses the right project for all the commands in this notebook.\n",
    "\n",
    "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "yXZlxqICsMg2"
   },
   "outputs": [],
   "source": [
    "PROJECT_ID = \"<your-project>\" # @param {type:\"string\"}\n",
    "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n",
    "! gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KAg-2-BQ4un6"
   },
   "source": [
    "This section runs initialization and authentication. It creates an authenticated session which is required for running any of the following sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gGuRq4DI47hj"
   },
   "source": [
    "## Authenticate your GCP account\n",
    "\n",
    "**If you are using AI Platform Notebooks**, your environment is already\n",
    "authenticated. Skip this step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**If you are using Colab**, run the cell below and follow the instructions\n",
    "when prompted to authenticate your account via oAuth.\n",
    "\n",
    "**Otherwise**, follow these steps:\n",
    "\n",
    "1. In the GCP Console, go to the [**Create service account key**\n",
    "   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n",
    "\n",
    "2. From the **Service account** drop-down list, select **New service account**.\n",
    "\n",
    "3. In the **Service account name** field, enter a name.\n",
    "\n",
    "4. From the **Role** drop-down list, select\n",
    "   **AutoML > AutoML Admin** and\n",
    "   **Storage > Storage Object Admin**.\n",
    "\n",
    "5. Click *Create*. A JSON file that contains your key downloads to your\n",
    "local environment.\n",
    "\n",
    "6. Enter the path to your service account key as the\n",
    "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "m3j1Kl4osNaJ"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "# If you are running this notebook in Colab, run this cell and follow the\n",
    "# instructions to authenticate your GCP account. This provides access to your\n",
    "# Cloud Storage bucket and lets you submit training jobs and prediction\n",
    "# requests.\n",
    "\n",
    "if 'google.colab' in sys.modules:    \n",
    "  from google.colab import files\n",
    "  keyfile_upload = files.upload()\n",
    "  keyfile = list(keyfile_upload.keys())[0]\n",
    "  %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n",
    "# If you are running this notebook locally, replace the string below with the\n",
    "# path to your service account key and run this cell to authenticate your GCP\n",
    "# account.\n",
    "else:\n",
    "  %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9zuplbargStJ"
   },
   "source": [
    "## Install the client library\n",
    "Run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "KIdmobtSsPj8",
    "outputId": "14c234ca-5070-4301-a48c-c69d16ae4c31"
   },
   "outputs": [],
   "source": [
    "%pip install --quiet google-cloud-automl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import libraries and define constants\n",
    "\n",
    "First, import Python libraries required for training,\n",
    "The code example below demonstrates importing the AutoML Python API module into a python script. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# AutoML library\n",
    "from google.cloud import automl_v1beta1 as automl\n",
    "\n",
    "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.display import HTML\n",
    "HTML(\"<script>Jupyter.notebook.kernel.restart()</script>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create API Client to AutoML Service*\n",
    "\n",
    "**If you are using AI Platform Notebooks**, or *Colab* environment is already\n",
    "authenticated using GOOGLE_APPLICATION_CREDENTIALS. Run this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**If you are using Colab or Jupyter**, and you have defined a service account\n",
    "follow the following steps to create the AutoML client\n",
    "\n",
    "You can see a different way to create the API Clients using service account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from google.oauth2 import service_account\n",
    "# credentials = service_account.Credentials.from_service_account_file('/path/to/service_account.json')\n",
    "# client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION, credentials=credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Storage setup\n",
    "\n",
    "You also need to upload your data into [Google Cloud Storage](https://cloud.google.com/storage/) (GCS) or [BigQuery](https://cloud.google.com/bigquery/). \n",
    "For example, to use GCS as your data source:\n",
    "\n",
    "* [Create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).\n",
    "* Upload the training and batch prediction files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "e1fYDBjDgYEB"
   },
   "source": [
    "# 3. Import, clean, transform and perform feature engineering on the training Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dYoCTvaAgZK2"
   },
   "source": [
    "### Create dataset in AutoML Tables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "uPRPqyw2gebp"
   },
   "source": [
    "Select a dataset display name and pass your table source information to create a new dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Iu3KNlcwsRhN"
   },
   "outputs": [],
   "source": [
    "#@title Create dataset { vertical-output: true, output-height: 200 }\n",
    "\n",
    "dataset_display_name = 'colab_trial1' #@param {type: 'string'}\n",
    "\n",
    "dataset = client.create_dataset(dataset_display_name)\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iTT5N97D0YPo"
   },
   "source": [
    "Create a bucket to store the training data in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RQuGIbyGgud9"
   },
   "outputs": [],
   "source": [
    "#@title Create bucket to store data in { vertical-output: true, output-height: 200 }\n",
    "\n",
    "bucket_name = '<your-bucket>' #@param {type: 'string'}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "IQJuy1-PpF3b"
   },
   "source": [
    "### Import Dependencies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zzCeDmnnQRNy"
   },
   "outputs": [],
   "source": [
    "!sudo pip install google-cloud-bigquery google-cloud-storage pandas pandas-gbq gcsfs oauth2client\n",
    "\n",
    "import datetime\n",
    "import pandas as pd\n",
    "\n",
    "import gcsfs\n",
    "from google.cloud import bigquery\n",
    "from google.cloud import storage\n",
    "\n",
    "client_bq = bigquery.Client(location='US', project=PROJECT_ID)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "UR5n1crIpQuX"
   },
   "source": [
    "### Transformation and Feature Engineering Functions\n",
    "\n",
    "The data cleaning and transformation step was by far the most involved. It includes a few sections that create an AutoML tables dataset, pull the Google merchandise store data from BigQuery, transform the data, and save it multiple times to csv files in google cloud storage.\n",
    "\n",
    "The dataset that is made viewable in the AutoML Tables UI. It will eventually hold the training data after that training data is cleaned and transformed.\n",
    "\n",
    "This dataset has only around 1% of its values with a positive label value of True i.e. cases when a transaction was made. This is a class imbalance problem. There are several ways to handle class imbalance. We chose to oversample the positive class by random over sampling. This resulted in an artificial increase in the sessions with the positive label of true transaction value.\n",
    "\n",
    "There were also many columns with either all missing or all constant values. These columns would not add any signal to our model, so we dropped them.\n",
    "\n",
    "There were also columns with NaN rather than 0 values. For instance, rather than having a count of 0, a column might have a null value. So we added code to change some of these null values to 0, specifically in our target column, in which null values were not allowed by AutoML Tables. However, AutoML Tables can handle null values for the features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Feature Engineering\n",
    "\n",
    "The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Weekdays\n",
    "\n",
    "The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Data Leakage\n",
    "\n",
    "The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One feature that exhibited this behavior was the number of page views a customer made during a session. By conditioning on page views in a session, we could very reliably predict which customer sessions a purchase would be made in. At first this seems like the golden ticket, we can reliably predict whether or not a purchase is made! The catch: the full page view information can only be collected at the end of the session, by which point we would also have whether or not a transaction was made. Seen from this perspective, collecting page views at the same time as collecting the transaction information would make it pointless to predict the transaction information using the page views information, as we would already have both. One solution was to drop page views as a feature entirely. This would safely stop the data leakage, but we would lose some critically useful information. Another solution, (the one we ended up going with), was to track the page view information of all previous sessions for a given customer, and use it to inform the current session. This way, we could use the page view information, but only the information that we would have before the session even began. So we created a new column called previous_views, and populated it with the total count of all previous page views made by the customer in all previous sessions. We then deleted the page views feature, to stop the data leakage.\n",
    "\n",
    "Our rationale for this change can be boiled down to the concise heuristic: only use the information that is available to us on the first click of the session. Applying this reasoning, we performed similar data engineering on other features which we found to be proxies for the label feature. We also refined our objective in the process: For a visit to the Google Merchandise store, what is the probability that a customer will make a purchase, and can we calculate this probability the moment the customer arrives? By clarifying the question, we both made the result more powerful/useful, and eliminated the data leakage that threatened to make the predictive power trivial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RODZJaq4o9b5"
   },
   "outputs": [],
   "source": [
    "def balanceTable(table):\n",
    "    #class count\n",
    "  count_class_false, count_class_true = table.totalTransactionRevenue.value_counts()\n",
    "\n",
    "    #divide by class\n",
    "  table_class_false = table[table[\"totalTransactionRevenue\"] == False]\n",
    "  table_class_true = table[table[\"totalTransactionRevenue\"] == True]\n",
    "\n",
    "    #random over-sampling\n",
    "  table_class_true_over = table_class_true.sample(count_class_false, replace = True)\n",
    "  table_test_over = pd.concat([table_class_false, table_class_true_over])\n",
    "  return table_test_over\n",
    "\n",
    "\n",
    "def partitionTable(table, dt=20170500):\n",
    "  #the automl tables model could be training on future data and implicitly learning about past data in the testing\n",
    "  #dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)\n",
    "  #and doing validation with the last three months of data (table2).\n",
    "  table1 = table[table[\"date\"] <= dt]\n",
    "  table2 = table[table[\"date\"] > dt]\n",
    "  return table1, table2\n",
    "\n",
    "def N_updatePrevCount(table, new_column, old_column):\n",
    "  table = table.fillna(0)\n",
    "  table[new_column] = 1\n",
    "  table.sort_values(by=['fullVisitorId','date'])\n",
    "  table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(lambda x: x.cumsum())\n",
    "  table.drop([old_column], axis = 1, inplace = True)\n",
    "  return table\n",
    "\n",
    "\n",
    "def N_updateDate(table):\n",
    "  table['weekday'] = 1\n",
    "  table['date'] = pd.to_datetime(table['date'].astype(str), format = '%Y%m%d')\n",
    "  table['weekday'] = table['date'].dt.dayofweek\n",
    "  return table\n",
    "\n",
    "\n",
    "def change_transaction_values(table):\n",
    "  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)\n",
    "  table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(lambda x: x!=0)\n",
    "  return table\n",
    "\n",
    "def saveTable(table, csv_file_name, bucket_name):\n",
    "  table.to_csv(csv_file_name, index = False)\n",
    "  storage_client = storage.Client()\n",
    "  bucket = storage_client.get_bucket(bucket_name)\n",
    "  blob = bucket.blob(csv_file_name)\n",
    "  blob.upload_from_filename(filename = csv_file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "2eGAIUmRqjqX"
   },
   "source": [
    "### Import training data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XTmXPMUsTgEs"
   },
   "source": [
    "You also have the option of just downloading the file, FULL.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "Bl9-DSjIqj7c"
   },
   "outputs": [],
   "source": [
    "#@title Input name of file to save data to { vertical-output: true, output-height: 200 }\n",
    "query = \"\"\"\n",
    "SELECT\n",
    " date, \n",
    " device, \n",
    " geoNetwork, \n",
    " totals, \n",
    " trafficSource, \n",
    " fullVisitorId \n",
    "FROM \n",
    " `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n",
    "WHERE\n",
    " _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n",
    " FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n",
    "\"\"\"\n",
    "df = client_bq.query(query).to_dataframe()\n",
    "print(df.iloc[:3])\n",
    "path_to_data_pre_transformation = \"FULL.csv\" #@param {type: 'string'}\n",
    "saveTable(df, path_to_data_pre_transformation, bucket_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V5WK71tiq-2b"
   },
   "source": [
    "### Unnest the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "SELECT\n",
    " date, \n",
    " device, \n",
    " geoNetwork, \n",
    " totals, \n",
    " trafficSource, \n",
    " fullVisitorId \n",
    "FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n",
    "WHERE\n",
    "_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n",
    "FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n",
    "\"\"\"\n",
    "df = client_bq.query(query).to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RFpgLfeNqUBk"
   },
   "outputs": [],
   "source": [
    "#some transformations on the basic dataset\n",
    "#@title Input the name of file to hold the unnested data to { vertical-output: true, output-height: 200 }\n",
    "unnested_file_name = \"FULL_unnested.csv\" #@param {type: 'string'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "2dyJlNAVqXUn"
   },
   "source": [
    "You also have the option of just downloading the file, FULL_unnested.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL_unnested.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tLPHeF2Y2l5l"
   },
   "outputs": [],
   "source": [
    "\n",
    "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+unnested_file_name, low_memory=False)\n",
    "\n",
    "column_names = ['device', 'geoNetwork','totals', 'trafficSource']\n",
    "\n",
    "for name in column_names:\n",
    "  print(name)\n",
    "  table[name] = table[name].apply(lambda i: dict(eval(i)))\n",
    "  temp = table[name].apply(pd.Series)\n",
    "  table = pd.concat([table, temp], axis=1).drop(name, axis=1)\n",
    "\n",
    "#need to drop a column\n",
    "table.drop(['adwordsClickInfo'], axis = 1, inplace = True)\n",
    "saveTable(table, unnested_file_name, bucket_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9_WC-AJLsdqo"
   },
   "source": [
    "### Run the Transformations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 272
    },
    "colab_type": "code",
    "id": "YWQ4462vnpOg",
    "outputId": "5ca7e95a-e0f2-48c2-9b59-8f043d233bd2"
   },
   "outputs": [],
   "source": [
    "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+unnested_file_name, low_memory=False)\n",
    "\n",
    "consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', 'campaignCode', \n",
    "'cityId', 'flashVersion', 'javaEnabled', 'language', 'latitude', 'longitude', 'mobileDeviceBranding', \n",
    "'mobileDeviceInfo', 'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector', 'networkLocation', \n",
    "'operatingSystemVersion', 'screenColors', 'screenResolution', 'screenviews', 'sessionQualityDim', 'timeOnScreen',\n",
    "'visits', 'uniqueScreenviews', 'browserVersion','referralPath','fullVisitorId', 'date']\n",
    "\n",
    "table = N_updatePrevCount(table, 'previous_views', 'pageviews')\n",
    "table = N_updatePrevCount(table, 'previous_hits', 'hits')\n",
    "table = N_updatePrevCount(table, 'previous_timeOnSite', 'timeOnSite')\n",
    "table = N_updatePrevCount(table, 'previous_Bounces', 'bounces')\n",
    "\n",
    "table = change_transaction_values(table)\n",
    "\n",
    "table1, table2 = partitionTable(table)\n",
    "table1 = N_updateDate(table1)\n",
    "table2 = N_updateDate(table2)\n",
    "#validation_unnested_FULL.csv = the last 3 months of data\n",
    "\n",
    "table1.drop(consts, axis = 1, inplace = True)\n",
    "table2.drop(consts, axis = 1, inplace = True)\n",
    "\n",
    "saveTable(table2,'validation_unnested_FULL.csv', bucket_name)\n",
    "\n",
    "table1 = balanceTable(table1)\n",
    "\n",
    "#training_unnested_FULL.csv = the first 9 months of data\n",
    "saveTable(table1, 'training_unnested_balanced_FULL.csv', bucket_name)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "LqmARBnRHWh8"
   },
   "outputs": [],
   "source": [
    "#@title ... take the data source from GCS { vertical-output: true } \n",
    "\n",
    "dataset_gcs_input_uris = ['gs://{}/training_unnested_balanced_FULL.csv'.format(bucket_name),] #@param\n",
    "import_data_operation = client.import_data(\n",
    "    dataset=dataset,\n",
    "    gcs_input_uris=dataset_gcs_input_uris\n",
    ")\n",
    "print('Dataset import operation: {}'.format(import_data_operation))\n",
    "\n",
    "# Synchronous check of operation status. Wait until import is done.\n",
    "import_data_operation.result()\n",
    "dataset = client.get_dataset(dataset_name=dataset.name)\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "W3SiSLS4tml9"
   },
   "source": [
    "# 4. Update dataset: assign a label column and enable nullable columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jVo8Z8PGtpB7"
   },
   "source": [
    "AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 255
    },
    "colab_type": "code",
    "id": "dMdOoFsXxyxj",
    "outputId": "e6fab957-2316-48c0-be66-1bff9dc5c23c"
   },
   "outputs": [],
   "source": [
    "# List table specs\n",
    "list_table_specs_response = client.list_table_specs(dataset=dataset)\n",
    "table_specs = [s for s in list_table_specs_response]\n",
    "\n",
    "# List column specs\n",
    "list_column_specs_response = client.list_column_specs(dataset=dataset)\n",
    "column_specs = {s.display_name: s for s in list_column_specs_response}\n",
    "\n",
    "# Print Features and data_type:\n",
    "\n",
    "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n",
    "print('Feature list:\\n')\n",
    "for feature in features:\n",
    "    print(feature[0],':', feature[1])\n",
    "    \n",
    "# Table schema pie chart.\n",
    "\n",
    "type_counts = {}\n",
    "for column_spec in column_specs.values():\n",
    "  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n",
    "  type_counts[type_name] = type_counts.get(type_name, 0) + 1\n",
    "    \n",
    "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n",
    "plt.axis('equal')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AfT4upKysamH"
   },
   "outputs": [],
   "source": [
    "#@title Update a column: set to not nullable { vertical-output: true }\n",
    "\n",
    "update_column_response = client.update_column_spec(\n",
    "    dataset=dataset,\n",
    "    column_spec_display_name='totalTransactionRevenue',\n",
    "    nullable=False,\n",
    ")\n",
    "update_column_response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3O9cFko3t3ai"
   },
   "source": [
    "**Tip:** You can use kwarg `type_code='CATEGORY'` in the preceding `update_column_spec(..)` call to convert the column data type from `FLOAT64` `to `CATEGORY`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rR2RaPP7t6y8"
   },
   "source": [
    "### Update dataset: assign a target column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "aTt2mIzbsduV"
   },
   "outputs": [],
   "source": [
    "#@title Update dataset { vertical-output: true }\n",
    "\n",
    "update_dataset_response = client.set_target_column(\n",
    "    dataset=dataset,\n",
    "    column_spec_display_name='totalTransactionRevenue',\n",
    ")\n",
    "update_dataset_response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xajewSavt9K1"
   },
   "source": [
    "# 5. Creating a model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dA-FE6iWt-A_"
   },
   "source": [
    "### Train a model\n",
    "\n",
    "To create the datasets for training, testing and validation, we first had to consider what kind of data we were dealing with. The data we had keeps track of all customer sessions with the Google Merchandise store over a year. AutoML tables does its own training and testing, and delivers a quite nice UI to view the results in. For the training and testing dataset then, we simply used the over sampled, balanced dataset created by the transformations described above. But we first partitioned the dataset to include the first 9 months in one table and the last 3 in another. This allowed us to train and test with an entirely different dataset that what we used to validate.\n",
    "\n",
    "Moreover, we held off on oversampling for the validation dataset, to not bias the data that we would ultimately use to judge the success of our model.\n",
    "\n",
    "The decision to divide the sessions along time was made to avoid the model training on future data to predict past data. (This can be avoided with a datetime variable in the dataset and by toggling a button in the UI)\n",
    "\n",
    "Training the model may take one hour or more. The following cell keeps running until the training is done. If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n",
    "\n",
    "    model = client.get_model(model_name=model_name)\n",
    "    \n",
    "Note that we trained on the first 9 months of data and we validate using the last 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Kp0gGkp8H3zj"
   },
   "outputs": [],
   "source": [
    "#@title Create model { vertical-output: true }\n",
    "#this will create a model that can be access through the auto ml tables colab\n",
    "model_display_name = 'trial_1' #@param {type:'string'}\n",
    "\n",
    "create_model_response = client.create_model(\n",
    "    model_display_name,\n",
    "    dataset=dataset,\n",
    "    train_budget_milli_node_hours=1000,\n",
    ")\n",
    "print('Create model operation: {}'.format(create_model_response.operation))\n",
    "# Wait until model training is done.\n",
    "model = create_model_response.result()\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tCIk1e4UuDxZ"
   },
   "source": [
    "# 6. Make a prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "H7Fi5f9zuG5f"
   },
   "source": [
    "In this section, we take our validation data prediction results and plot the Precision Recall Curve and the ROC curve of both the false and true predictions.\n",
    "\n",
    "There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction. Please replace the 'your_test_bucket' of the gcs_destination with your own bucket where the predictions results will be stored by AutoML Tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "AZ_CPff77m4e"
   },
   "outputs": [],
   "source": [
    "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n",
    "\n",
    "batch_predict_gcs_input_uris = ['gs://cloud-ml-data-tables/notebooks/validation_unnested_FULL.csv',] #@param\n",
    "batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(bucket_name) #@param {type:'string'}\n",
    "batch_predict_response = client.batch_predict(\n",
    "    model=model, \n",
    "    gcs_input_uris=batch_predict_gcs_input_uris,\n",
    "    gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,\n",
    ")\n",
    "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n",
    "# Wait until batch prediction is done.\n",
    "batch_predict_result = batch_predict_response.result()\n",
    "batch_predict_response.metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "utGPmXI-uKNr"
   },
   "source": [
    "# 7. Evaluate your prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GsOdhJeauTC3"
   },
   "source": [
    "The follow cell creates a Precision Recall Curve and a ROC curve for both the true and false classifications.\n",
    "Fill in the batch_predict_results_location with the location of the results.csv file created in the previous \"Make a prediction\" step\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "orejkh0CH4mu"
   },
   "outputs": [],
   "source": [
    "\n",
    "import numpy as np\n",
    "from sklearn import metrics\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def invert(x):\n",
    "  return 1-x\n",
    "\n",
    "def switch_label(x):\n",
    "  return(not x)\n",
    "batch_predict_results_location = 'gs://<your-test-bucket>/' #@param {type:'string'}\n",
    "\n",
    "table = pd.read_csv(batch_predict_results_location +'/<resultid>/tables_1.csv')\n",
    "y = table[\"totalTransactionRevenue\"]\n",
    "scores = table[\"totalTransactionRevenue_1.0_score\"]\n",
    "scores_invert = table['totalTransactionRevenue_0.0_score']\n",
    "\n",
    "#code for ROC curve, for true values\n",
    "fpr, tpr, thresholds = metrics.roc_curve(y, scores)\n",
    "roc_auc = metrics.auc(fpr, tpr)\n",
    "\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot(fpr, tpr, color='darkorange',\n",
    "         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n",
    "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic for True')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()\n",
    "\n",
    "\n",
    "#code for ROC curve, for false values\n",
    "plt.figure()\n",
    "lw = 2\n",
    "label_invert = y.apply(switch_label)\n",
    "fpr, tpr, thresholds = metrics.roc_curve(label_invert, scores_invert)\n",
    "plt.plot(fpr, tpr, color='darkorange',\n",
    "         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n",
    "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic for False')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()\n",
    "\n",
    "\n",
    "#code for PR curve, for true values\n",
    "\n",
    "precision, recall, thresholds = metrics.precision_recall_curve(y, scores)\n",
    "\n",
    "\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot( recall, precision, color='darkorange',\n",
    "         lw=lw, label='Precision recall curve for True')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.title('Precision Recall Curve for True')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()\n",
    "\n",
    "#code for PR curve, for false values\n",
    "\n",
    "precision, recall, thresholds = metrics.precision_recall_curve(label_invert, scores_invert)\n",
    "print(precision.shape)\n",
    "print(recall.shape)\n",
    "\n",
    "plt.figure()\n",
    "lw = 2\n",
    "plt.plot( recall, precision, color='darkorange',\n",
    "          label='Precision recall curve for False')\n",
    "plt.xlim([0.0, 1.1])\n",
    "plt.ylim([0.0, 1.1])\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.title('Precision Recall Curve for False')\n",
    "plt.legend(loc=\"lower right\")\n",
    "plt.show()\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "colab_C4M.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
